Formant tracking using segmental phonemic information
نویسندگان
چکیده
A new formant tracking algorithm using phoneme dependent nominal formant values is tested. The algorithm consists of three phases: (1) analysis, (2) segmentation, and (3) formant tracking. In the analysis phase, formant candidates are obtained by solving for the roots of the linear prediction polynomial. In the segmentation phase, the input text is converted into a sequence of phonemic symbols. Then the sequence is time aligned with the speech utterance. Finally, a set of formant candidates that are close to the nominal formant estimates while satisfying the continuity constraints are chosen. The new algorithm significantly reduces the formant tracking error rate (3.62%) over a formant tracking algorithm using only continuity constraints (13.04%). We will also discuss how to further reduce the tracking error rate. INTRODUCTION In the Bell Labs' Text-To-Speech (TTS) system [1], a limited number of acoustic units is stored in the inventory table. Therefore, it is important to be able to choose the best candidate for each synthesis unit (diphone, triphone, etc). Formants values can be used for selecting the best units as well as for testing unit compatibility to determine whether any two synthesis units are connectable in term of spectral discrepancy [1]. Thus, reliable formant tracking can be one of the crucial components in TTS system construction, where a huge amount of speech data has to be processed. Due to the size of the speech corpus, it would be prohibitive to rely on human intervention for formant tracking error correction. For decades, researchers have put efforts into improving the performance of speech formant tracking algorithms. Nevertheless, state-of-the-art formant tracking algorithms are not reliable enough for unsupervised, automatic usage. Even though the errors are obvious to the human eye when displayed in a longer time frame, a human might not do much better than the automatic formant trackers given only local information. This observation has led to methods that impose continuity constraints on the formant selection process [2],[3]. However, they still tend to generate errors by enforcing the continuity constraints too strongly or too weakly. Especially in highly transient phone boundaries such as consonantvowel transitions, continuityconstraints often cause tracking errors [4],[5],[6] . Fortunately, in the TTS system construction process, transcriptions of the speech utterances are available. During speech corpus recording, a speaker is asked to read a set of texts that are carefully selected. From the text, the phonemic transcription can be generated automatically. Then, the transcription can be time aligned with the acoustic speech signal using signal processing techniques. Using this forced time alignment, the exact time stamp for each phonemic event can be obtained. In this paper, we test a new algorithm for tracking speech formant trajectories using segmental phonemic information. Given a speech interval, it is assumed that the phonemic identity and nominal formant values for the phoneme are available. This assumption holds always in TTS applications. The implementation is based on previous work [7] in which only continuity constraints were used. We will show how much improvement can be achieved by using phonemic information for formant tracking. ALGORITHM The formant tracking algorithm consists of three phases: (1) analysis, (2) segmentation/alignment, and (3) formant track selection. In the analysis phase, formant candidates are obtained by LPC analysis on pre-emphasized speech. Formant candidates are obtained by solving for the roots of the linear prediction polynomial. In the segmentation phase, the input text is converted into a sequence of phonemic symbols, and the phonemic symbols are time aligned with the speech utterance. Finally, in the formant tracking phase, the best combination of formant frequencies is selected from the candidates based on minimum cost criteria. For each analysis frame, we choose a set of formant candidates that are closest to the nominal formant estimates while satisfying the continuity constraints. Speech Analysis Autocorrelation LPC analysis is performed on the preemphasized speech. An LPC order of 12 is used for speech data collected at a sampling rate of 11.025 kHz. Thus, ten complex poles (five conjugate pairs) will be used to model five formants and the extra two poles for the spectral tilt that might have not been compensated for by the pre-emphasis process. Pitch-asynchronous LPC coefficients are calculated every 5 ms. A Hamming window of 25ms is applied to each analysis frame. Formant frequency candidates are calculated by solving the prediction 6th European Conference on Speech Communication and Technology (EUROSPEECH’99) Budapest, Hungary, September 5-9, 1999 ISCA Archive http://www.isca-speech.org/archive
منابع مشابه
Robust, n-best formant tracking
We describe a robust, N{best formant tracker. The 2 stage algorithm initially nds single formants or parts thereof. In the second stage a robust dynamic programming search with a wild card mechanism is employed to nd the N best consistent interpretation of the initial formant information. The selection of the correct formant tracks is delayed until after the phonetic search, thus overcoming the...
متن کاملSpeech recognition using non-linear trajectories in a formant-based articulatory layer of a multiple-level segmental HMM
This paper describes how non-linear formant trajectories, based on ‘trajectory HMM’ proposed by Tokuda et al., can be exploited under the framework of multiple-level segmental HMMs. In the resultant model, named a non-linear/linear multiple-level segmental HMM, speech dynamics are modeled as non-linear smooth trajectories in the formant-based intermediate layer. These formant trajectories are m...
متن کاملDevelopments of the Research of the Formant Tracking Algorithm
The formant is the important part of the phonetic characters, and reliable formant tracking algorithm is the base to study the phonetics. Based on the development course of the phonetic formant tracking algorithm, the linear prediction coding (LPC) and the model matching method are introduced emphatically, and there own advantages and disadvantages are analyzed, and the model matching method ba...
متن کامل-Quantitative Transfer in Reduplicative and Templatic Morphology
Segmental quantity-the distinction between long and short vowels or geminate and simplex consonants-is preserved under specifiable conditions in reduplication.’ Current nonlinear phonology holds, for a number of compelling reasons, that segmental quantity is represented confrgurationally, in the mapping of phonemic melodies to prosodic templates. The theory also holds that reduplication is acco...
متن کاملModelling Speech Signals using Formant Frequencies as an Intermediate Representation
This paper concerns Multiple-level Segmental HiddenMarkov Models (M-SHMMs) in which the relationship between symbolic and acoustic representations of speech is regulated by a formant-based intermediate representation. New TIMIT phone recognition results are presented, confirming that the theoretical upper-bound on performance is achieved provided that either the intermediate representation or t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999